Jasper Slingsby
Variables are latent if they are unobserved or estimated with uncertainty. Their true value is not known and can only be inferred indirectly through a model from other variables that can be directly observed or measured.
Dietze 2017 outlines 4 common latent variable models:
I’ve already mentioned that a big challenge to modelling is error in the observation of the state variable of interest.
Observation errors are typically either:
random, due to imprecision of the data collection process or other extraneous factors, or
systematic, implying there is some bias
Imprecision in measurement creates random error.
Inaccuracy creates systematic error.
Random error is most commonly created by imprecision in measurement (“scatter”) around the true state of the variable of interest, but can be created by other processes.
In this case we may want to model the true state as a latent variable, and model the random observation error as a probability distribution (typically Gaussian) with mean of 0 (or the mean of the process model).
E.g. specifying the data model to be a normal distribution, as we did in the post-fire recovery model:
\[\begin{gather} NDVI_{i,t}\sim \mathit{N}(\mu_{i,t},\frac{1}{\sqrt{\tau}}) \\ \end{gather}\]
Systematic error is where there is a bias, such as created by differences among observers or poor instrument calibration.
Constant bias can be corrected with an offset, but something like sensor drift may need to be approximated as a random walk or similar (to account for temporal autocorrelation).
If we have more information about the causes of error, we can apply more complex observation models (e.g. differences among field staff, etc).
Often there is both random and systematic error, requiring a model that accounts for both.
I.e. observing a proxy for the state variable of interest, e.g.
There are many ways to relate the observed proxy(ies) to the latent state variable of interest, such as empirical calibration curves, probabilities of identifying dung correctly, etc.
Where some observations may be missing from the data, these may be estimated with uncertainty in various ways.
Missing data are common in time series or in space (e.g. sensor failure, logistical difficulties, etc.).
Some variables may never be observed (e.g too difficult to measure), but can be inferred from the process model, e.g.:
Estimating these latent variables can be tricky, but having multiple independent measures to constrain the estimates or high confidence in the model structure (i.e. mechanistic understanding) can help.
Forecasting involves predicting key variables further in time, and often farther in space.
It usually has to deal with a number of latent variables due to missing or sparse data, observation error, etc, and these are often connected in time and/or space (i.e. autocorrelated).
State-space models are a useful framework for dealing with these kinds of problems and for forecasting in general.
For time-series of discrete states variables (i.e. categorical response) they are also referred to as Hidden Markov models
When extended to spatial or space-time models they are called Markov random fields
The name comes from the focus on estimating latent state variables.
In doing so, they explicitly separate observation errors from process errors.
Autocorrelation…
Illustration of a simple univariate state-space model from Auger-Methe et al. 2021.
Once the dependence of the observations \(y_t\) on the states \(z_t\) is accounted for, the observations are assumed to be independent.
A toy model demonstrating that the state estimates (mean and 95% confidence interval - black line and grey band respectively) can be a closer approximation of the true states (red dots) than the observations (blue dots).
Before I can introduce Bayes, there are a few basic building blocks we need to establish first.
Traditional parametric statistics like regression analysis and analysis of variance (ANOVA) rely on Ordinary Least Squares (OLS).
There are other flavours of least squares that allow more flexibility (e.g. nonlinear (NLS) that we used in the practical, partial least squares, etc), but I’m not going to go into these distinctions.
In general, least squares approaches “fit” (i.e. estimate the parameters of) models by minimizing the sums of the squared residuals.
The model (blue line) is drawn through the points to minimize the sum of the squared vertical (y-axis) differences between each point and the regression line (i.e. residuals).
Redrawn highlighting the residuals:
A histogram of the residuals:
The residuals approximate a normal distribution
This is a key assumption when using Least Squares
1. Least Squares doesn’t explicitly include a data model
It’s useful at this stage to make a distinction between data models and process models.
The process model is the bit you’ll be used to, where we describe how the model creates a prediction for a particular set of inputs or covariates (e.g. a linear model)
The data model describes the residuals (i.e. mismatch between the process model and the data)
Least Squares analyses don’t explicitly include a data model, because minimizing the sums of squares means that the data model in a Least Squares analysis can only ever be a normal distribution (i.e. there must be homogeneity of variance)
pred.negexp/S)fit.negexp/S.MLE), which includes the data model2. Least squares focuses on what the parameters are not, rather than what they are
Least Squares focuses on null hypothesis testing - the ability to reject (or to fail to reject) the null hypothesis at some threshold of significance (alpha; usually P < 0.05)
For a linear model, the null hypothesis is that the slope = 0 (i.e. no effect of X on Y)
A linear model is only considered useful when you can reject the null hypothesis.
It tells you nothing about the probability of the parameters actually being the estimates you arrives at by minimizing the sums of squares!!!?
The likelihood principle = a parameter value is more likely than another if it is the one for which the data are more probable.
Maximum Likelihood Estimation (MLE) = a method for estimating model parameters by applying the likelihood principle. It optimizes parameter values to maximize the likelihood that the process described by the model produced the data observed.
Viewed differently…
With MLE we assume we have the correct model and use MLE to choose parameter values that maximize the conditional probability of the data given those parameter values, \(P(Data|Parameters)\).
BUT!
What we really want is to know is the conditional probability of the parameters given the data, \(P(Parameters|Data)\), because this allows us to express uncertainty in the parameter estimates as probabilities.
The interactions between two random variables are illustrated below…
Venn diagram illustrating ten events (points) across two sets \(x\) and \(y\).
The joint probability, \(P(x,y)\), is the probability of both \(x\) and \(y\) occurring simultaneously
This is the probability of occurring within the overlap of the circles = 3/10
Needless to say the joint probability of \(P(y,x)\) is identical to \(P(x,y)\) (= 3/10)
We can also define two conditional probabilities:
In the diagram these would be:
Last, we can define two marginal probabilities for \(x\) and \(y\).
These are just the separate probabilities of being in set \(x\) or being in set \(y\) given the full set of events, i.e.
We can also show that the joint probabilities are the product of the conditional and marginal probabilities:
\[P(x,y) = P(x|y) P(y) = 3/6 * 6/10 = 0.3\]
and:
\[P(y,x) = P(y|x) P(x) = 3/7 * 7/10 = 0.3\]
Tadaa!
This means we can derive what we want to know, \(P(Parameters|Data)\), as a function of the information maximum likelihood estimation provides, \(P(Data|Parameters)\).
…Let’s do the derivation…
Now let:
We’re interested in the conditional probability of the parameters given the data, \(p(\theta|D)\).
To get this, we need to take the previous equations and solve for \(p(\theta|D)\).
Since we know the joint probabilities are identical:
\[p(\theta,D) = p(\theta|D)p(D)\]
\[p(D,\theta) = p(D|\theta)p(\theta)\]
We can take the right hand side of the two:
\[p(\theta|D)p(D) = p(D|\theta)p(\theta)\]
and solve for \(p(\theta|D)\) as:
\[p(\theta|D) = \frac{p(D|\theta) \; p(\theta)}{p(D)} \;\;\]
which is known as Bayes’ Theorem!
Rewriting the terms on one line allows us to label them with their names:
\[ \underbrace{p(\theta|D)}_\text{posterior} \; = \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; / \; \underbrace{p(D)}_\text{evidence} \]
The marginal probability of the data, \(p(D)\), is now called the evidence for the model, and represents the overall probability of the data according to the model.
Getting rid of the evidence allows us to focus on the important bits:
\[ \underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; \]
Which reads “The posterior is proportional to the likelihood times the prior”.
This leaves us with three terms:
\(p(D|\theta)\) still represents the probability of the data given the model with parameter values \(\theta\), and is used in analyses to find the likelihood profiles of the parameters.
\[
\underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \;
\]
\[
\underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \;
\]
The prior represents the credibility of the parameter values, \(\theta\), without the data, \(D\).
“But how can we know anything about the parameter values without the data?”
By applying the scientific method, whereby we interrogate new evidence (the data) in the context of previous knowledge or information to update our understanding.
It is very easy to specify an inappropriate prior and bias the outcome of your analysis!
We’ll get stuck into examples showing the value of Bayes in ecology in the next lecture.